Can this model be used to predict ozone if we collect new data in the future?
Uncertainty
Confidence interval: uncertainty in the mean response at a given predictor value.
Prediction interval: uncertainty in a single response at a given predictor value.
What it means
95% confidence interval: Given the parameters of the model, we are 95% confident that the mean response at a given predictor value is between y_1 and y_2.
95% prediction interval: Given the parameters of the model, we are 95% confident that a single response at a given predictor value is between y_1 and y_2.
Equations
The equation to calculate any prediction interval is:
x_0 is value of the predictor for which we want a response
MSE is the mean squared error of the fit (SS_{xx})
\sum_{i=1}^n (x_i - \bar{x})^2 is the sum of squares of the predictor values
n is the number of observations
\bar{x} is the mean of the predictor values
The prediction interval formula has an additional term (1 \cdot MSE). There is uncertainty that the mean prediction will be similar to the observed, and additional uncertainty/variability for a single response (equivalent to the MSE). Thus the confidence interval is always narrower than the prediction interval.
Predictions in R
First, we need to create a new data frame with the predictor values we want to predict at – it must include all variables in the fitted model.
Use multiple metrics to test prediction quality, and always plot the predicted vs observed.
Example: Loyn dataset
We will go through several examples to practice data splitting, cross-validation, and model evaluation.
About
Data on the relationship between bird abundance (bird ha-1) and the characteristics of forest patches at 56 locations in SE Victoria.
The predictor variables are:
ALT Altitude (m)
YR.ISOL Year when the patch was isolated (years)
GRAZE Grazing (coded 1-5 which is light to heavy)
AREA Patch area (ha)
DIST Distance to nearest patch (km)
LDIST Distance to largest patch (km)
Code
loyn <-read_csv("images/loyn.csv")
Dataset splitting
We will split the data into training and test sets.
As the dataset is quite small, we will use a 80:20 split.
Code
set.seed(100)indexes <-sample(1:nrow(loyn), size =0.2*nrow(loyn)) # randomly sample 20% of rows in the datasetloyn_train <- loyn[-indexes, ] # remove the 20% - training datasetloyn_test <- loyn[indexes, ] # select the 20% - test dataset
Checking the split
Check out the str() of the data to see if the split worked (number of observations).
Looks like AREALDIST and DIST are skewed – we will transform them so that they are more normally distributed.
Transforming predictors
We will use log10() to transform the predictors. The mutate() function from the dplyr package is useful for this as it can create new columns in the data frame with the transformed values.
Then, remove the untransformed variables from the dataset. Here we can use the select() function from the dplyr package to “delselect” columns by using the - sign.
We start with a full model that includes all the predictors.
Code
full_fit <-lm(ABUND ~ ., data = loyn_train)summary(full_fit)
Call:
lm(formula = ABUND ~ ., data = loyn_train)
Residuals:
Min 1Q Median 3Q Max
-17.3445 -3.4647 0.1991 2.8689 14.1844
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -159.27533 109.13660 -1.459 0.1527
YR.ISOL 0.09334 0.05392 1.731 0.0916 .
GRAZE -1.40912 1.03653 -1.359 0.1820
ALT 0.01657 0.02810 0.589 0.5590
AREA_L10 8.09629 1.78591 4.533 5.63e-05 ***
LDIST_L10 2.05115 3.23927 0.633 0.5304
DIST_L10 -6.18596 4.83189 -1.280 0.2082
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 6.497 on 38 degrees of freedom
Multiple R-squared: 0.6752, Adjusted R-squared: 0.6239
F-statistic: 13.17 on 6 and 38 DF, p-value: 5.277e-08
Assumptions - Round 1
As usual, we should check the assumptions of the model (CLINE + outliers). We will use the check_model() function from the perfomance package (because it looks nice and has interpretation instructions).
We check multicollinearity with variable inflation factors (VIF) - VIFs are all < 10, so there is no multicollinearity. All assumptions are thus met.
Code
check_model(full_fit, check =c("vif"))
Backwards stepwise selection
Use the step() function perform backwards stepwise selection. This function uses AIC to select the best model.
Depending on the dataset splitting, the best model may be different each time we randomly sample the data. In this case we should all have the same results as we set the seed.
If we compare to the full model, the adjusted r-squared is slightly higher, and the AIC is lower.
p1 <-ggplot(loyn_train, aes(ABUND, pred)) +geom_point() +geom_abline(intercept =0, slope =1, color ="red") +labs(x ="Observed ABUND", y ="Predicted ABUND",title ="Training") +theme_classic()p2 <-ggplot(loyn_test, aes(ABUND, pred)) +geom_point() +geom_abline(intercept =0, slope =1, color ="red") +labs(x ="Observed ABUND", y ="Predicted ABUND",title ="Test") +theme_classic()p1 + p2
Calculating metrics - error
Generally the training dataset will have lower error and higher linearity than the test dataset. If this difference is very large – it suggests the model is not applicable to the test data and overfitting.
With error – lower is better. The following all measure error in the same units as the response variable (number of birds in each forest patch).
We expect the ME for the training dataset to be near 0 – a well-fitted linear regression model will have positive and negative residuals balance each other out.